# can add quietly=T option to the require() function
loadPkg = function(x) { if (!require(x,character.only=T, quietly =T)) { install.packages(x,dep=T,repos="http://cran.us.r-project.org"); if(!require(x,character.only=T)) stop("Package not found") } }

Chapter 1: Introduction of Cardiovascular Disease:

Based on The World Health Organization (WHO), Cardiovascular diseases (CVDs) are disorders related to the heart and blood vessels. The diseases mainly caused by fatty deposits plaque builds up on the inner walls of the blood vessels which prevent prevents blood from flowing to the heart or brain.

The process of fatty plaque formation.

The process of fatty plaque formation.

According to 2016 report, cardiovascular disease remains the leading cause of death in the United States (Benjamin et al., 2019). Around 80% of CVD deaths are a heart attack and stroke. The cause of cardiovascular diseases is usually the presence of a combination of risk factors, such as unhealthy diet, obesity, physical inactivity, tobacco use and harmful use of alcohol.

Body mass index (BMI) is a value calculated from the weight and height of a person, the equation is kg(kilogram)/m(meter)^2. It is a measurement to assess a person’s total amount of body fat. As measuring BMI only needs a person’s weight and height, it has been widely used in public health and clinical settings.

Since there are many reports indicated that the cause of cardiovascular diseases is associated with our BMI and lifestyle. Therefore, we want to evaluate whether these factors truly correlate with the developing of the disease.

Chapter 2: Description of Data and Exploratory Data Analysis

2.1 Source Data

The source data for our EDA is a CSV containing 70,000 records of patients data in 12 features: age, height, weight, gender, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking, alcohol intake, physical activity, and presence or absence of cardiovascular disease. (https://www.kaggle.com/sulianova/cardiovascular-disease-dataset)

2.2 Preprocessing Data

We noticed that variable ‘age’ is int(day), which were converted into int(years).As height and weight individually do not mean much to patients’ health, so we calculated Body Mass Index (BMI), a measure of body fat based on height and weight that applies to adult men and women, and added it as a feature. Also column ‘id’ was droped.

##       age        gender        height          weight           ap_hi        
##  Min.   :30.00   1:45530   Min.   : 55.0   Min.   : 10.00   Min.   : -150.0  
##  1st Qu.:48.00   2:24470   1st Qu.:159.0   1st Qu.: 65.00   1st Qu.:  120.0  
##  Median :54.00             Median :165.0   Median : 72.00   Median :  120.0  
##  Mean   :53.34             Mean   :164.4   Mean   : 74.21   Mean   :  128.8  
##  3rd Qu.:58.00             3rd Qu.:170.0   3rd Qu.: 82.00   3rd Qu.:  140.0  
##  Max.   :65.00             Max.   :250.0   Max.   :200.00   Max.   :16020.0  
##      ap_lo          cholesterol gluc      smoke     alco      active   
##  Min.   :  -70.00   1:52385     1:59479   0:63831   0:66236   0:13739  
##  1st Qu.:   80.00   2: 9549     2: 5190   1: 6169   1: 3764   1:56261  
##  Median :   80.00   3: 8066     3: 5331                                
##  Mean   :   96.63                                                      
##  3rd Qu.:   90.00                                                      
##  Max.   :11000.00                                                      
##  cardio         bmi         
##  0:35021   Min.   :  3.472  
##  1:34979   1st Qu.: 23.875  
##            Median : 26.374  
##            Mean   : 27.557  
##            3rd Qu.: 30.222  
##            Max.   :298.667

We noticed that the min value of systolic blood pressure(ap_hi) and diastolic blood pressure (ap_lo) are negative values, which do not make sense. In addition, diastolic blood pressure is supposed to be lower than systolic blood pressure. The data were further cleaned based on these crterion.

Then the distribution of age, height, weight, ap_hi and ap_lo was checked.

The histogram of age shows that there are only few observation for age<35, which could not represent the population of age<35, so the observations with age<35 were droped. For height, weight, ap_hi, and ap_lo, the histograms were way skewed by some extreme outliers, which were droped in this step.

The distribution of age, height, weight, ap_hi and ap_lo was checked again after outliers removed.

2.3 Correlations among Different Variables

The correlation matrix was displayed to get an idea of the correlations among different variables.

## corrplot 0.84 loaded

We noticed that bmi is positively correlated with both ap_hi and ap_lo; ap_hi is positively correlated with age;and ap_hi and ap_lo are of course highly positively correlated.

Chapter 3: Cardio

3.1 SMART Question

What are the risk factors of cardiovascular diseases? Is gender, BMI, cholesterol level, glucose level, smoking, alcohol over-consumption and lack of exercise correlated to the development of cardiovascular disease?

3.2 Basic analyze

First, we categorize BMI values into 5 groups, starting from numerical value ranging (1-10), and then add the numerical value by 10, sequentially. Chi-square test is being used as the preferred method, as it shows the correlation between cardiovascular disease and BMI, age, cholesterol level, glucose level, smoking behavior, alcohol over-consumption, and lack of exercise.

## 
##  Pearson's Chi-squared test
## 
## data:  cardio_bmi
## X-squared = 1546.2, df = 4, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  cardio_glucose
## X-squared = 476.48, df = 2, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cardio_smoke
## X-squared = 30.402, df = 1, p-value = 3.511e-08
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cardio_alco
## X-squared = 9.4869, df = 1, p-value = 0.002069
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cardio_active
## X-squared = 88.188, df = 1, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  cardio_age
## X-squared = 2974.3, df = 3, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  cardio_cholesterol
## X-squared = 2976.9, df = 2, p-value < 2.2e-16

The null hypothesis is rejected as all p-values are small. All factors indicated above are considered to be risk factors of cardiovascular disease.

In addition to Chapter 2, a bar plot is generated to show the relationship between age and the onset of cardiovascular disease.

As the bar plot shows, the number of elderly with cardiovascular disease is higher than the number of younger people with cardiovascular disease. We will discuss how age may affect cardiovascular disease in further detail in Chapter 5.

Chapter 4: BMI

4.1 SMART Question

What is the relationship between BMI and cardiovascular diseas and what factors will affect bmi?

4.2 Is that bmi affect cardiovascular disease?

A bar plot is generated to show the relationship between age and the onset of cardiovascular disease.

By comparing the BMI group with the incidence of getting cardiovascular disease, we conclude that people with higher BMI are more likely to develop cardiovascular disease. Likewise, people with cardiovascular disease are also more likely to have higher BMI.

In addition, we subset cardiovascular disease group, with [cardio0] for people without cardiovascular disease and [cardio1] for people with cardiovascular disease. Then, we compare the mean and histogram between people with cardiovascular disease and without cardiovascular disease.

## [1] 26.31278
## [1] 27.94742

From both mean and histogram, we observe that people with cardiovascular diseases tend to have higher BMI than people without cardiovascular diseases. We conclude BMI and cardiovascular have correlation.

We will go on with the hypothesis that risk factors of high BMI value will also be the risk factors of cardiovascular diseases.

4.3 Basic analysis of BMIgroup

We fisrt use chi-test to see the relationship between BMIgroup and glucose, smoke, alchol level, exercise.

## 
##  Pearson's Chi-squared test
## 
## data:  bmi_glucose
## X-squared = 637.17, df = 8, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  bmi_smoke
## X-squared = 124.54, df = 4, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  bmi_alco
## X-squared = 53.676, df = 4, p-value = 6.153e-11
## 
##  Pearson's Chi-squared test
## 
## data:  bmi_active
## X-squared = 8.5446, df = 4, p-value = 0.07355

The null hypothesis for high glucose level, smoking behavior, alcohol over-consumption, and lack of exercise are rejected as all p-values are small. We conclude high BMI is more correlated to cardiovascular disease.

In addition, the analysis indicates BMI level is associated with the presence of cardiovascular disease. Smoking behavior, alcohol-consumption, and glucose level are also associated with the onset of cardiovascular disease.

Furthermore, we fail to reject H0 as the p-value for activity level is greater than 0.05. There is no impact on BMI among people who are active and people who are not active.

4.4 Conclusion of BMI

We use t-test and boxplot to analyze BMI and subset each risk factors. Specifically, we subset women and men for gender, smoke (0) and not smoke (1) for smoking behavior, noalco (0) and alco (1) for alcohol consumption, and noactive (0) and active (1) for activity level. Boxplot helps us to find groups with higher BMI.

## 
##  Welch Two Sample t-test
## 
## data:  cardio0$bmi and cardio1$bmi
## t = -45.412, df = 61629, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.705197 -1.564092
## sample estimates:
## mean of x mean of y 
##  26.31278  27.94742

## 
##  Welch Two Sample t-test
## 
## data:  women$bmi and men$bmi
## t = 36.099, df = 55336, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.196799 1.334223
## sample estimates:
## mean of x mean of y 
##  27.56122  26.29571

## 
##  Welch Two Sample t-test
## 
## data:  not_smoke$bmi and smoke$bmi
## t = 11.943, df = 6773.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5853395 0.8152236
## sample estimates:
## mean of x mean of y 
##  27.18058  26.48030

## 
##  Welch Two Sample t-test
## 
## data:  noalco$bmi and alco$bmi
## t = -2.8869, df = 3647, p-value = 0.003914
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.38919138 -0.07436675
## sample estimates:
## mean of x mean of y 
##  27.10803  27.33981

## 
##  Welch Two Sample t-test
## 
## data:  noactive$bmi and active$bmi
## t = 2.1267, df = 18396, p-value = 0.03346
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.007756491 0.190271227
## sample estimates:
## mean of x mean of y 
##  27.19976  27.10075

We conclude that no smoking, alcohol consumption, and female gender are contributed to higher BMI. In addition, the discussed characteristics has less impact on BMI among people who are active than people who are not active. Because cardiovascular diseases have correlation with BMI, People who no smoking, drink alcohol, and female gender also be the risk factors of cardiovascular diseases.

Chapter 5: Age

5.1 SMART Question

Are the mean values of different factors such as systolic blood pressure and diastolic blood pressure same across age group?

5.2.1 Are systolic blood pressure the same across all age group

We mention age and cardio diseases have relationship. We we discuss about age and bloos pressure.

H0: The mean values of Systolic blood pressure are the same across all agegroup.

H1: The mean values of Systolic blood pressure are different across age groups.

ANOVA and TukeyHSD are used to test the hypothesis and calculate the p-value. The diagram blow summarizes the results.

##                Df   Sum Sq Mean Sq F value Pr(>F)    
## ageGroup        3   454301  151434   769.1 <2e-16 ***
## Residuals   62495 12305612     197                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ap_hi ~ ageGroup, data = cardio)
## 
## $ageGroup
##                   diff       lwr      upr     p adj
## 45-54-35-44  4.0170952  3.569310 4.464880 0.0000000
## 55-64-35-44  7.7224200  7.281768 8.163072 0.0000000
## 65-74-35-44  7.5728352  5.527256 9.618414 0.0000000
## 55-64-45-54  3.7053248  3.392737 4.017912 0.0000000
## 65-74-45-54  3.5557400  1.533877 5.577603 0.0000370
## 65-74-55-64 -0.1495848 -2.169880 1.870710 0.9975614

Based on tukeyHSD, we conclude that people in age (65 ~74) and (55 ~64) have different systolic blood pressure.

The null hypothesis is rejected as all p-values are small. We conclude that the mean values of Systolic blood pressure are different across all age groups.

5.3 Are diastolic blood pressure the same across all age group

H0: The mean values of diastolic blood pressure are the same across all agegroup.

H1: The mean values of diastolic blood pressure are different across age groups.

ANOVA and TukeyHSD are used to test the hypothesis and calculate the p-value. The diagram blow summarizes the results.

##                Df  Sum Sq Mean Sq F value Pr(>F)    
## ageGroup        3   70521   23507   407.1 <2e-16 ***
## Residuals   62495 3608599      58                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ap_lo ~ ageGroup, data = cardio)
## 
## $ageGroup
##                   diff         lwr       upr     p adj
## 45-54-35-44  1.8768690  1.63438260 2.1193553 0.0000000
## 55-64-35-44  3.1436368  2.90501313 3.3822604 0.0000000
## 65-74-35-44  3.0355532  1.92782310 4.1432833 0.0000000
## 55-64-45-54  1.2667678  1.09749421 1.4360414 0.0000000
## 65-74-45-54  1.1586842  0.06379689 2.2535715 0.0331826
## 65-74-55-64 -0.1080836 -1.20212193 0.9859547 0.9942677

Based on tukeyHSD, we conclude that people in age (65~74) and (55~64) have different diastolic blood pressure.

The null hypothesis is rejected as all p-values are small. We conclude that the mean values of diastolic blood pressure are different across all age groups.

We can conclude that age will affect blood pressure. High blood pressure increases the incidence of getting cardiovascular disease.

Chapter 6: Cardiovascular Disease Prediction Method

Being overweight or obese substantially increases your risk of developing cardiovascular disease. However, researchers don’t always agree which method is best for quantifying whether an individual is “too” overweight. So, this section will analyze if the BMI is the best meacurement at predicting risk.

6.1 SMART Question

Is BMI the best measurement at predicting risk in every individual? If not, is there another measurement to help to predict the risk of cardiovascular disease?

6.2 Cardio in Different BMI Group

To find the adult weight classification, see which of these BMI ranges the weight falls into:

BMI adult weight classification
[0, 18.5) kg/m^2 underWeight
[18.5, 25) kg/m^2 normalWeight
[25, 30) kg/m^2 overWeight
[30, 35) kg/m^2 obese
[35, 45) kg/m^2 severelyObese
[45, 50) kg/m^2 morbidlyObese
[50, Inf) kg/m^2 superObese

The bar chart pressents the incidence of cardiovascular disease in groups of different BMI levels. From the figure we can see that before the severelyObese level, as the BMI parameter continues to increase, the incidence of cardiovascular disease gradually increases. When the BMI level exceeds severelyObese level, the morbidity rate tends to be stable and has a slow downward trend. However, BMI is not always accurate in every individual. It overestimates body fat in people with a lot of muscle mass and tends to underestimate it in elderly people. So, the idea of using waist circumference as a risk predictor stems from the fact comes up.

6.3 Predicted Waist Circumference by BMI

Carrying excess body fat around your middle is more of a health risk than if weight is on your hips and thighs. In that case, waist circumference is a better estimate of visceral fat, the dangerous internal fat that coats the organs.

An initial model expressed the regression of WC on BMI in the following form:

WCi = b0 + b1BMIi + b2AGEi + b3BLACKi + b4HISPi + ei

where i indexes individuals, WCi is waist circumference for individual i, BMIi is body mass index, AGEi is current age (in years), BLACKi is an indicator for African-American, HISPi is an indicator for Hispanic ethnicity, and ei is the residual.

For women the pattern was better summarized by using one constant for age<35years and a separate intercept and slope for age≥35years. Thus, the model for women was

WCi = c0 + c1BMIi + c2I{AGEi ≥ 35} + c3AGEi × I{AGEi ≥ 35} + c4BLACKi + c5HISPi + ei

where I{B} is an indicator function: I{B} = 1 when B is true and 0 otherwise.

After the prediction, the sample of the data is as follows:

##   gender      bmi predict_waist
## 1      2 21.96712      85.90547
## 2      1 34.92768     108.87168
## 3      1 23.50781      83.16440
## 4      2 28.71048     102.58695
## 6      1 29.38468      97.20714

6.4 Cut Off Line and Obese

Studies have shown that a waist circumference of 95cm or more in men, and of 88cm or more in women, is associated with elevated cardiovascular risk. So, we use these parameters as cuf off line for each gender.

A Body Mass Index of 25kg/m^2 or more is defined as obese, which means the risk of having cardiovascular disease is higher.

At the same time, two parameters are defined here: “safe area” and “warning area”. When both waist circumference and BMI parameters are lower than the cut off line and obese parameters, the result is safe area, otherwise the result is warnning area.

The statistical results of cardiovascular disease after cut off and obese classification are as follows:

gender obese cut_off bmi_waist cardio n
1 normal weight below cut off safe area 0 8920
1 normal weight below cut off safe area 1 6057
1 normal weight over cut off warning area 1 2
1 obese below cut off warning area 0 1285
1 obese below cut off warning area 1 785
1 obese over cut off warning area 0 10329
1 obese over cut off warning area 1 13337
2 normal weight below cut off safe area 0 5383
2 normal weight below cut off safe area 1 3579
2 normal weight over cut off warning area 0 47
2 normal weight over cut off warning area 1 83
2 obese below cut off warning area 0 648
2 obese below cut off warning area 1 376
2 obese over cut off warning area 0 5019
2 obese over cut off warning area 1 6649

6.5 Cut Off Line, Obese, Cardiovascular Disease

  • The bar plot below shows the relationship between obesity and cardiovascular disease morbidity in all genders, men and women.

From the three obesity vs cardiovascular disease bar plots, the risk of having cardiovascular disease in all genders is 15.6% if they have normal weight, and 33.8% if they are obese. In only women section, the risk of having cardiovascular disease is 14.9% if she has normal weight, and 34.7% if she is obese. In only men section, the risk of having cardiovascular disease is 16.8% if he has normal weight, and 32.2% if he is obese.

  • The bar plot below shows the relationship between cut off line and cardiovascular disease morbidity in all genders, men and women.

From the three cut off line vs cardiovascular disease bar plots, the risk of having cardiovascular disease in all genders is 17.3% if the waist circumference is below cut off line, and 32.1% if the waist circumference is over cut off line. In only women section, the risk of having cardiovascular disease is 16.8% if her waist circumference is below cut off line, and 32.8% if her waist circumference is over cut off line. In only men section, the risk of having cardiovascular disease is 18.2% if his waist circumference is below cut off line, and 30.9% if his waist circumference is over cut off line.

  • The bar plot below shows the relationship between safe/warning area and cardiovascular disease morbidity in all genders, men and women.

From the three safe/warning area vs cardiovascular disease bar plots, the risk of having cardiovascular disease in all genders is 15.4% if he/she is in safe area, and 34% if he/she is in warning area. In only women section, the risk of having cardiovascular disease is 14.9% if she is in safe area, and 34.7% if she is in warning area. In only men section, the risk of having cardiovascular disease is 16.4% if he is in safe area, and 32.6% if he is in warning area.

6.6 Conclusion

  • From the three different measurements above (Cut Off Line vs Cardiovascular Disease; Obese vs Cardiovascular Disease; Cut Off Line + Obese vs Cardiovascular Disease), we could see that all of the three measurements work well to help people predict the risk of cardiovascular disease. However, the Cut Off Line + Obese vs Cardiovascular Disease measurement method performs best.
  • If you just have your weight and height data and want to predict the disease in an easy way, you can just use them to calculate the BMI value, and use the Obese vs Cardiovascular Disease measurement. Although this method of prediction is not the most accurate in terms of results, it can also be used because the required data is relatively simple and there is no major problem with the overall trend of the prediction results.
  • If you want to predict the risk in a more accurate result, you can also measure your waist circumference while measuring your height and weight.

Chapter 7: Model building and evaulation

7.1 SMART Question

What are the risk factors of cardiovascular diseases? Are all the variables correlated to the development of cardiovascular disease?

7.2.1 Basic analyze

## 
## Call:
## glm(formula = cardio ~ gender + age + ap_hi + ap_lo + cholesterol + 
##     bmi + gluc + smoke + alco + active, family = "binomial", 
##     data = cardio)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0174  -0.9166  -0.3889   0.9299   2.5614  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -12.512389   0.141718 -88.291  < 2e-16 ***
## gender2        0.019639   0.020592   0.954    0.340    
## age            0.050466   0.001418  35.596  < 2e-16 ***
## ap_hi          0.062213   0.001047  59.435  < 2e-16 ***
## ap_lo          0.015256   0.001765   8.645  < 2e-16 ***
## cholesterol2   0.361843   0.028912  12.515  < 2e-16 ***
## cholesterol3   1.088438   0.037622  28.931  < 2e-16 ***
## bmi            0.029144   0.002137  13.638  < 2e-16 ***
## gluc2          0.007199   0.038427   0.187    0.851    
## gluc3         -0.319649   0.041611  -7.682 1.57e-14 ***
## smoke1        -0.158285   0.036725  -4.310 1.63e-05 ***
## alco1         -0.217539   0.044819  -4.854 1.21e-06 ***
## active1       -0.238012   0.022943 -10.374  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 86633  on 62498  degrees of freedom
## Residual deviance: 70251  on 62486  degrees of freedom
## AIC: 70277
## 
## Number of Fisher Scoring iterations: 4

All the coefficients, except gender, are found significant (small p-values). Thus, gender is dropped.

## 
## Call:
## glm(formula = cardio ~ age + ap_hi + ap_lo + cholesterol + bmi + 
##     gluc + smoke + alco + active, family = "binomial", data = cardio)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0186  -0.9168  -0.3893   0.9302   2.5593  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -12.505617   0.141526 -88.362  < 2e-16 ***
## age            0.050446   0.001418  35.585  < 2e-16 ***
## ap_hi          0.062242   0.001046  59.486  < 2e-16 ***
## ap_lo          0.015304   0.001764   8.676  < 2e-16 ***
## cholesterol2   0.360920   0.028895  12.491  < 2e-16 ***
## cholesterol3   1.087543   0.037610  28.916  < 2e-16 ***
## bmi            0.028883   0.002119  13.628  < 2e-16 ***
## gluc2          0.007304   0.038427   0.190    0.849    
## gluc3         -0.319628   0.041613  -7.681 1.58e-14 ***
## smoke1        -0.147992   0.035104  -4.216 2.49e-05 ***
## alco1         -0.214920   0.044734  -4.804 1.55e-06 ***
## active1       -0.238103   0.022942 -10.378  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 86633  on 62498  degrees of freedom
## Residual deviance: 70252  on 62487  degrees of freedom
## AIC: 70276
## 
## Number of Fisher Scoring iterations: 4
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
##                 GVIF Df GVIF^(1/(2*Df))
## gender      1.152808  1        1.073689
## age         1.015821  1        1.007880
## ap_hi       1.748779  1        1.322414
## ap_lo       1.729927  1        1.315267
## cholesterol 1.500081  2        1.106697
## bmi         1.063761  1        1.031388
## gluc        1.483287  2        1.103586
## smoke       1.244471  1        1.115559
## alco        1.139847  1        1.067636
## active      1.002560  1        1.001279

Here we use GVIF to check whether collinearity is a problem in this logistic regression model. Typically, GVIF only comes into play for factors and polynomial variables. Variables which require more than 1 coefficient and thus more than 1 degree of freedom are typically evaluated using the GVIF. For one-coefficient terms VIF equals GVIF. The rule of GVIF2(1/(2×Df))<2 is applied, which would equal a VIF of 4 for one-coefficient variables. Thus, here in our logistic regression model, collinearity is not a problem, and all the coefficients, are found significant (small p-values).

7.2.2 Hosmer and Lemeshow test

The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.

## ResourceSelection 0.3-5   2019-07-22
## Warning in Ops.factor(1, y): '-' not meaningful for factors

The result is shown here:

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  cardio$cardio, fitted(cardiologit)
## X-squared = 62499, df = 8, p-value < 2.2e-16

The p-value of 0 is smaller than 0.05. This indicates the model is a good fit

7.2.3 ROC curve and AUC

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7874

We have here the area-under-curve of 0.7873812, which is slightly less than 0.8. This test evaluates the model as a not so good fit.

7.2.4 McFadden

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
##           llh       llhNull            G2      McFadden          r2ML 
## -3.512576e+04 -4.331635e+04  1.638118e+04  1.890877e-01  2.305682e-01 
##          r2CU 
##  3.074396e-01

With the McFadden value of 0.1890877, which is analgous to the coefficient of determination R\(2\), about 18.9% of the variations in cardio is explained by the explanatory variables in the model.

According to the three model evaluation, this logistic regression is a relatively ok model.

Chapter 8: Ridge and Lasso Regression

8.1 Preparation: Standardize the numerical variables

When we perform Ridge or Lasso regression, as with most other cases, standardization of variables (z-scores is a typical choice) is very important.

Here we introduced a function uzsale(df, append=0, excl=NULL) which will convert all numerical values to the respective z-scores. The base R library can do that too, but this new function is safe with categorical variable as well, and added some choice options.

8.2 Ridge Regression

## Warning: Column `age` has different attributes on LHS and RHS of join
## Warning: Column `ap_hi` has different attributes on LHS and RHS of join
## Warning: Column `ap_lo` has different attributes on LHS and RHS of join
## Warning: Column `bmi` has different attributes on LHS and RHS of join
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 3.0-1

## [1] 0.02169438
##  (Intercept)  (Intercept)          age        ap_hi        ap_lo cholesterol2 
##  -0.07049091   0.00000000   0.25298104   0.72162713   0.19205375   0.31928457 
## cholesterol3        gluc2        gluc3       smoke1        alco1      active1 
##   0.92053303   0.03557552  -0.18581276  -0.13156551  -0.18546758  -0.20772134 
##          bmi 
##   0.11450625
##  (Intercept)          age        ap_hi        ap_lo cholesterol2 cholesterol3 
##  -0.07049091   0.25298104   0.72162713   0.19205375   0.31928457   0.92053303 
##        gluc2        gluc3       smoke1        alco1      active1          bmi 
##   0.03557552  -0.18581276  -0.13156551  -0.18546758  -0.20772134   0.11450625

The best λ for Ridge regression is almost 0, which gives the least square fit. All the coefficiences are used, which agrees with the GVIF gained in Chapter 3 that there is no multicollinearity in the predictors.

8.3 Lasso Regression

## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

## [1] 0.0008964143
##  (Intercept)  (Intercept)          age        ap_hi        ap_lo cholesterol2 
## -0.009655563  0.000000000  0.296961528  0.846433715  0.088783375  0.193588280 
## cholesterol3        gluc2        gluc3       smoke1        alco1      active1 
##  0.759340280  0.000000000  0.000000000 -0.015708352 -0.015541297 -0.109079165 
##          bmi 
##  0.099056559
##  (Intercept)          age        ap_hi        ap_lo cholesterol2 cholesterol3 
## -0.009655563  0.296961528  0.846433715  0.088783375  0.193588280  0.759340280 
##       smoke1        alco1      active1          bmi 
## -0.015708352 -0.015541297 -0.109079165  0.099056559

The best λ for Lasso regression is almost 0, which gives the least square fit and agrees with Ridge Regression. Very similar to Ridge, but Lasso regression often forces many parameters to be exactly zero. This makes Lasso Regression a good feature selection tool as well. In this case, gender and gluc are dropped from coefficients to avoid overfitting and this result match with the p-value calculated from full model of logictic regression.

Chapter 9: KNN

## 
## Attaching package: 'gmodels'
## The following object is masked from 'package:pROC':
## 
##     ci
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  18750 
## 
##  
##                   | cardio_1NN 
## cardio.testLabels |         0 |         1 | Row Total | 
## ------------------|-----------|-----------|-----------|
##                 0 |      5928 |      3619 |      9547 | 
##                   |     0.621 |     0.379 |     0.509 | 
##                   |     0.630 |     0.387 |           | 
##                   |     0.316 |     0.193 |           | 
## ------------------|-----------|-----------|-----------|
##                 1 |      3475 |      5728 |      9203 | 
##                   |     0.378 |     0.622 |     0.491 | 
##                   |     0.370 |     0.613 |           | 
##                   |     0.185 |     0.305 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |      9403 |      9347 |     18750 | 
##                   |     0.501 |     0.499 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  18750 
## 
##  
##                   | cardio_3NN 
## cardio.testLabels |         0 |         1 | Row Total | 
## ------------------|-----------|-----------|-----------|
##                 0 |      6365 |      3182 |      9547 | 
##                   |     0.667 |     0.333 |     0.509 | 
##                   |     0.660 |     0.349 |           | 
##                   |     0.339 |     0.170 |           | 
## ------------------|-----------|-----------|-----------|
##                 1 |      3273 |      5930 |      9203 | 
##                   |     0.356 |     0.644 |     0.491 | 
##                   |     0.340 |     0.651 |           | 
##                   |     0.175 |     0.316 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |      9638 |      9112 |     18750 | 
##                   |     0.514 |     0.486 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  18750 
## 
##  
##                   | cardio_5NN 
## cardio.testLabels |         0 |         1 | Row Total | 
## ------------------|-----------|-----------|-----------|
##                 0 |      6605 |      2942 |      9547 | 
##                   |     0.692 |     0.308 |     0.509 | 
##                   |     0.676 |     0.328 |           | 
##                   |     0.352 |     0.157 |           | 
## ------------------|-----------|-----------|-----------|
##                 1 |      3169 |      6034 |      9203 | 
##                   |     0.344 |     0.656 |     0.491 | 
##                   |     0.324 |     0.672 |           | 
##                   |     0.169 |     0.322 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |      9774 |      8976 |     18750 | 
##                   |     0.521 |     0.479 |           | 
## ------------------|-----------|-----------|-----------|
## 
## 
## [1] 0.6216533
## [1] 0.6557333
## [1] 0.67408

The accuracies for k=1 is 62.2%; for k=3 is 65.6%; for k=5 is 67.4%.

Selecting the correct “k”

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

##  num [1:2, 1:6] 1 0.622 3 0.656 5 ...

It seems 7-nearest neighbors is an efficient choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off. The accuracies for k=7 is 68.5%.

Chapter 10: Decesion Tree

cardiodtfit <- rpart(cardio ~ age + gender + ap_hi + ap_lo + cholesterol + bmi + gluc + smoke + alco + active, method="class", data=cardio)
printcp(cardiodtfit) # display the results 
## 
## Classification tree:
## rpart(formula = cardio ~ age + gender + ap_hi + ap_lo + cholesterol + 
##     bmi + gluc + smoke + alco + active, data = cardio, method = "class")
## 
## Variables actually used in tree construction:
## [1] age         ap_hi       cholesterol
## 
## Root node error: 30868/62499 = 0.4939
## 
## n= 62499 
## 
##        CP nsplit rel error  xerror      xstd
## 1 0.40926      0   1.00000 1.00000 0.0040492
## 2 0.01001      1   0.59074 0.59074 0.0036816
## 3 0.01000      3   0.57072 0.58034 0.0036622
plotcp(cardiodtfit) # visualize cross-validation results 

summary(cardiodtfit) # detailed summary of splits
## Call:
## rpart(formula = cardio ~ age + gender + ap_hi + ap_lo + cholesterol + 
##     bmi + gluc + smoke + alco + active, data = cardio, method = "class")
##   n= 62499 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.40925878      0 1.0000000 1.0000000 0.004049167
## 2 0.01001037      1 0.5907412 0.5907412 0.003681571
## 3 0.01000000      3 0.5707205 0.5803421 0.003662230
## 
## Variable importance
##       ap_hi       ap_lo         age cholesterol         bmi        gluc 
##          49          29           8           8           5           1 
## 
## Node number 1: 62499 observations,    complexity param=0.4092588
##   predicted class=0  expected loss=0.4938959  P(node) =1
##     class counts: 31631 30868
##    probabilities: 0.506 0.494 
##   left son=2 (37556 obs) right son=3 (24943 obs)
##   Primary splits:
##       ap_hi       < 129.5    to the left,  improve=5583.6270, (0 missing)
##       ap_lo       < 85.5     to the left,  improve=3490.2120, (0 missing)
##       cholesterol splits as  LRR,          improve=1281.9080, (0 missing)
##       age         < 54.5     to the left,  improve=1198.3770, (0 missing)
##       bmi         < 27.43782 to the left,  improve= 728.8057, (0 missing)
##   Surrogate splits:
##       ap_lo       < 84.5     to the left,  agree=0.834, adj=0.583, (0 split)
##       cholesterol splits as  LRR,          agree=0.645, adj=0.109, (0 split)
##       bmi         < 29.66726 to the left,  agree=0.639, adj=0.095, (0 split)
##       age         < 61.5     to the left,  agree=0.615, adj=0.036, (0 split)
##       gluc        splits as  LRR,          agree=0.607, adj=0.016, (0 split)
## 
## Node number 2: 37556 observations,    complexity param=0.01001037
##   predicted class=0  expected loss=0.321653  P(node) =0.6009056
##     class counts: 25476 12080
##    probabilities: 0.678 0.322 
##   left son=4 (22650 obs) right son=5 (14906 obs)
##   Primary splits:
##       age         < 54.5     to the left,  improve=752.7512, (0 missing)
##       cholesterol splits as  LLR,          improve=588.2769, (0 missing)
##       ap_hi       < 118.5    to the left,  improve=208.8289, (0 missing)
##       bmi         < 27.8864  to the left,  improve=139.2997, (0 missing)
##       ap_lo       < 77.5     to the left,  improve=136.1550, (0 missing)
##   Surrogate splits:
##       cholesterol splits as  LLR,          agree=0.618, adj=0.038, (0 split)
##       gluc        splits as  LLR,          agree=0.609, adj=0.014, (0 split)
##       bmi         < 37.80563 to the left,  agree=0.604, adj=0.002, (0 split)
##       ap_lo       < 93.5     to the left,  agree=0.603, adj=0.001, (0 split)
## 
## Node number 3: 24943 observations
##   predicted class=1  expected loss=0.2467626  P(node) =0.3990944
##     class counts:  6155 18788
##    probabilities: 0.247 0.753 
## 
## Node number 4: 22650 observations
##   predicted class=0  expected loss=0.2404415  P(node) =0.3624058
##     class counts: 17204  5446
##    probabilities: 0.760 0.240 
## 
## Node number 5: 14906 observations,    complexity param=0.01001037
##   predicted class=0  expected loss=0.4450557  P(node) =0.2384998
##     class counts:  8272  6634
##    probabilities: 0.555 0.445 
##   left son=10 (13368 obs) right son=11 (1538 obs)
##   Primary splits:
##       cholesterol splits as  LLR,          improve=224.52640, (0 missing)
##       age         < 60.5     to the left,  improve=143.42590, (0 missing)
##       bmi         < 29.27448 to the left,  improve= 51.64790, (0 missing)
##       ap_hi       < 118.5    to the left,  improve= 50.34354, (0 missing)
##       active      splits as  RL,           improve= 38.03618, (0 missing)
##   Surrogate splits:
##       gluc splits as  LLR, agree=0.919, adj=0.218, (0 split)
## 
## Node number 10: 13368 observations
##   predicted class=0  expected loss=0.4156194  P(node) =0.2138914
##     class counts:  7812  5556
##    probabilities: 0.584 0.416 
## 
## Node number 11: 1538 observations
##   predicted class=1  expected loss=0.2990897  P(node) =0.02460839
##     class counts:   460  1078
##    probabilities: 0.299 0.701
# plot tree 
plot(cardiodtfit, uniform=TRUE, main="Classification Tree for cardio")
text(cardiodtfit, use.n=TRUE, all=TRUE, cex=.8)

We also use caret library to calculate these percentages in the confusion matrix.

## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## [1] "Overall: "
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   7.181235e-01   4.351948e-01   7.145781e-01   7.216486e-01   5.061041e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00  1.849779e-239
## [1] "Class: "
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7908697            0.6435791            0.6945416 
##       Neg Pred Value            Precision               Recall 
##            0.7501983            0.6945416            0.7908697 
##                   F1           Prevalence       Detection Rate 
##            0.7395823            0.5061041            0.4002624 
## Detection Prevalence    Balanced Accuracy 
##            0.5762972            0.7172244

The overall accuracy is 71.8%.

## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

Chapter 11: Conclusion and Discussion

In summary, our analysis indicated that BMI level and smoking are associated with the risk of cardiovascular diseases. Moreover, age, gender, smoking, blood glucose level, and alcohol use have an impact on BMI level.

We further subgroup the BMI value into weight classes cataloged by NIH. The plot showed that higher BMI values tend to have an increased risk of cardiovascular disease. Another way to measure in Cardiovascular disease risk is through waist circumference (WC). Abdominal obesity is a well-researched risk factor for CVD and is being suggested to be used in adjunct with BMI to determine a person’s CVD risk. We further predict WC value using a specific formula and set the parameter for plotting (Bozeman et al., 2012), our result indicated that WC value is also a good variable for predicting cardiovascular diseases.

There are many risk factors in Cardiovascular diseases, studies suggested that the genetic variances in patients have an impact on the development of the diseases. Furthermore, a persons’ family with cardiovascular diseases also increased their risk (Kathiresan & Srivastava, 2012). According to Harvard Health Publishing, the rates of high blood pressure, diabetes, and heart disease vary among people of different races and living countries. Therefore, the dataset could include patients’ family background, race, and ethnicity as additional variables for analyzing cardiovascular diseases.

BMI values have a higher correlation with cardiovascular disease. Risk factors that contribute to high BMI value also contribute to the onset of cardiovascular disease. From Chapter 3, we conclude that people with cardiovascular disease, female gender, no smoking and alcohol over-consumption behavior, and inactive, tend to have higher BMI value. From chapter 2-5, we conclude elderly people are more likely to have higher BMI values, and people who have higher BMI values have a higher risk of having cardiovascular disease. At the same time, in chapter 6, we listed three methods for predicting cardiovascular disease and found that the best prediction method is: use BMI and waist circumference. Having these two data can help people judge the probability of getting sick according to their physical condition.

For the model building and evaluation, we used the logistic regression function to analyze the relationship between CVD and other risk factors. The regression model indicated that age, blood pressure, and the level of cholesterol is correlated with CVD. It is worth mentioning that the higher cholesterol level is strongly associated with an increased risk of CVD. Furthermore, the vif evaluation suggested that the variables have no multicollinearity, the Hosmer and Lemeshow test also showed that the model is a good fit.

In our logistic regression, we want to know how variables influence cardiovascular diseases, we run the full model including gender, age, high blood pressure, low blood pressure, cholesterol level, bmi, glcose level, smoking, drinking alchol, and doing exercise. All the coefficients are found significant (small p-values) except gluc2, it may because gluc2 have small difference with gluc1 and gluc3 have large difference with gluc1. gender, age, low blood pressure, high blood pressure, cholesterol and bmi have positive effects on cardiovascular diseases (cadio = 1), while smoking, drinking alchol, do active have negatively affect the cardiovascular diseases. These are reasonable results and confirms our common beliefs. We use Hosmer and Lemeshow tes to evaluate logistic regression fit, Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity) and McFadden is evaluation tool we can use on logit regressions. First, in Hosmer and Lemeshow test, The p-value is very small. This indicates the model is a good fit. Secondly, we have here the area-under-curve of 0.7874, which is sightly less than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is a good fit. Finally, with the McFadden value 0.1890877, which is analgous to the coefficient of determination R\(2\), only about 18.9% of the variations in cardio is explained by the explanatory variables in the model. According to the three model evaluation, this logistic regression is a relatively ok model.

We further used Rigged and Lasso regression in our model to ensure that the model is not overfitting. In Lasso Regression, the coefficient ‘gluc’ is furthur dropped from the model we fitted using logistic regression to minimize overfitting.

By using KNN and selecting k=7, we could predict the target with accuracy of 68.5%. 7-nearest neighbors is an efficient choice because that’s the greatest improvement in predictive accuracy before the incremental improvement trails off.

In decision tree, we run the full model include gender, age, high blood pressure, low blood pressure, cholesterol level, bmi, glcose level, smoking, drinking alchol, and doing exercise. First, we develop visualize cross-validation results, cp is complexity parameter, provide the optimal prunings and we can prunue the tree to avoid any overfitting the data. CP value control the size of decision tree and select the optimal tree size. From graph, we can see that 4 variable is the best size for our model. Small cp value decrease relative error and increase accuracy. For our model, 3 variables have small relative error, decision tree preformance three variables. We use handy library to see accuracy, we can see the overall accuracy is 71.8%, it is ok model. In our fancyPlot, we can see the first importance variable is high blood pressure, second is age, third is cholesterol. For high blood pressure greater than 130, 40% people have cardiovascular diseases, For high blood pressure small than 130, age small than 55 years old, 36% people don’t have cardiovascular diseases. For high blood pressure small than 130, age large than 55 years old and have normal cholesterol level(cholesterol=1), 21% people don’t have cardiovascular diseases. For high blood pressure small than 130, age large than 55 years old and have abpve normal cholesterol level(cholesterol=2), 2% people have cardiovascular diseases. The result showed that blood pressure, age, and cholesterol level are important indicators to determine the CVD. This result consists of the logistic regression analysis which showed that blood pressure and cholesterol level are correlated with CVD.

Overall, this CVD database provides useful variables for us to conduct several chi-square tests, regression analysis, and model building.

References:

Benjamin, E. J., Muntner, P., Alonso, A., Bittencourt, M. S., Callaway, C. W., Carson, A. P., . . . Stroke Statistics, S. (2019). Heart Disease and Stroke Statistics-2019 Update: A Report From the American Heart Association. Circulation, 139(10), e56-e528. doi:10.1161/CIR.0000000000000659

Bozeman, S. R., Hoaglin, D. C., Burton, T. M., Pashos, C. L., Ben-Joseph, R. H., & Hollenbeak, C. S. (2012). Predicting waist circumference from body mass index. BMC Med Res Methodol, 12, 115. doi:10.1186/1471-2288-12-115

Cardiovascular Disease [Web log post]. Retrieved Oct 12, 2019, from https://www.who.int/health-topics/cardiovascular-diseases/

Kathiresan, S., & Srivastava, D. (2012). Genetics of human cardiovascular disease. Cell, 148(6), 1242-1257. doi:10.1016/j.cell.2012.03.001